A cross-verified database of notable people, 3500BC-2018AD

A new strand of literature aims at building the most comprehensive and accurate database of notable individuals. We collect a massive amount of data from various editions of Wikipedia and Wikidata. Using deduplication techniques over these partially overlapping sources, we cross-verify each retrieved information. For some variables, Wikipedia adds 15% more information when missing in Wikidata. We find very few errors in the part of the database that contains the most documented individuals but nontrivial error rates in the bottom of the notability distribution, due to sparse information and classification errors or ambiguity. Our strategy results in a cross-verified database of 2.29 million individuals (an elite of 1/43,000 of human being having ever lived), including a third who are not present in the English edition of Wikipedia. Data collection is driven by specific social science questions on gender, economic growth, urban and cultural development. We document an Anglo-Saxon bias present in the English edition of Wikipedia, and document when it matters and when not.

-they would be in the set we called F. We can, however, approximate the rate at which people survived to make it 23 into the final dataset, under the assumption that the fraction of notable people affecting society at the time they lived 24 is a constant of the living population at that time, and that they are forgotten at a constant rate per unit of time. This 25 is of course a pure thought experiment but provides an order of magnitude of the number of notable individuals we 26 may still be missing. 27 A.2 Removing duplicates: details 28 Dealing with possible duplicates is not an easy task as we need to separate these cases from real homonyms, i.e. individuals 29 sharing exactly the same name and first name. We use a total of eleven methods, all detailed below, ranging from string 30 normalization, phonetic encoding and string distance metrics to identify likely duplicate pairs that we eventually decide to 31 merge by manually checking their respective Wikipedia biographies. In order to reduce the number of candidates, which is 32 prohibitively large in our database, we determine a score for each candidate based on some additional features such as common 33 birth or death dates, the citizenship and domains of influence retrieved from these questionable biographies. This helps us 34 discard candidate pairs which were not duplicates. 35 We then construct a score ranging from 0 to 1 which corresponds to the likelihood for a set of biographies to correspond to 36 the same individual. A score above 0.75 for 4 criteria and above 8 for the remaining two identifies a 'cluster' of individuals who 37 have a high probability of being the same person; and we kept the person with the highest number of available biographical 38 information. We identify 34,562 true duplicates, that is 0.7% of the total number of individuals (34,562/4,678,040). 39 We use the following methods to remove duplicates: 40 1. Connected components solving: sometimes links between Wikipedia biographies are not mutual. It is therefore possible, 41 by gathering connected components of the page lowercase names' graph, to find suitable duplicate pairs. 4. String fingerprinting: there is a large variety of ways to write the same name. It is not rare, for instance, to see Asian names 48 written in the incorrect order by occidental clerks. String fingerprinting is a method which applies a set of transformations 49 to a string to normalize order, redudancy and case so one can match similar-looking strings. 50 1/20 5. Squeezed string fingerprinting: same as before except that we will "squeeze" consecutive duplicate letters into a single 51 one. For instance, the name "Brettner" would become "Bretner". This follows the observation that double letters tend not 52 to be well-respected across variants of the same name. 53 6. Small tokens filtering: small tokens composed of only one or two characters, such as "de" or "of", and stopwords tend to 54 be frequently forgotten in names. Filtering them will produce some more matches. 55 7. Rusalka phonetic encoding: by producing a symbolic phonetic representation of the considered names, one is often able 56 to match different transliterations or spellings. 57 8. Sorted neighborhood using the omission key and Levenshtein distance less than or equal to one: string distances such as 58 the Levenshtein distance are very useful to find similar-looking strings. Unfortunately, a naive approach to collect pairs of 59 duplicates in a dataset results in quadratic processing time. While this is acceptable for tiny datasets, it is not for millions of 60 names. The sorted neighborhood method can approximate pairwise computations by considering that if you order strings 61 using a specific key beforehand then similar pairs have a high probability of being close in the sorted list. A fixed-size 62 window is then slided across the sorted list where pairwise distances are computed and similar pairs reported. We first 63 choose to use the omission key, a string's key leveraging the frequency to which characters are omitted when misspelling 64 words, to sort our dataset before proceeding to find pairs having a very low Levenshtein distance. 65 9. Sorted neighborhood using the skeleton key and Levenshtein distance less than or equal to one: same as before but using 66 a different key, the skeleton key, leveraging the way words tend to be misspelled in the English language, i.e. misspelled 67 consonants are frequently not the first ones. 68 10. Cologne phonetic encoding: this phonetic encoding targets specifically German and similar languages and is a good 69 complement to the Rusalka one. Its precision is very low however since it tends to approximate sounds a lot. 70 11. Sorted neighborhood using the skeleton key and Levenshtein distance less than or equal to two. 71 Further references 1-4 . 72 A.3 Data collection using categories 73 We develop a methodology based on the information found in the categories of Wikipedia to approach the universe of notable 74 individuals. We scraped individuals from a particular procedure based on categories. Categories are present in the bottom 75 part of most biographies. These independent Wikipedia objects contain lists of individuals (and their associated urls) who 76 have one feature in common such as such as: birth date, death date, domain of influence, etc. In a first stage, we harvest all 77 links available in the "Living People" (https://en.Wikipedia.org/wiki/Category:Living_people) category of the English edition. In a 78 second stage, we explore additional categories such as "Possibly living people", "Deaths (resp. birth) by year", "Deaths (resp. 79 birth) by decades", "Deaths (resp. birth) by centuries" and "Deaths (resp. birth) by millennium", etc. to collect more urls. Last, we 80 parse the following list of categories to detect individuals that were not identified in the previous stages: "Date of birth missing", 81 "Date of birth unknown", "Date of death missing", "Date of death unknown", "Year of birth missing", "Year of birth unknown", "Year 82 of death missing", "Year of death unknown", "Place of birth missing", "Place of birth unknown", "Place of death missing", "Place 83 of death unknown". 84 A.4 Oldest registered entries and comparison with world population estimates 85 The first registered human in our database was born around 430,000BC, namely "Cranium 17", an ancient hominid skull. The 86 second oldest entry, 11,000BC, is a skull of a Paleo-Indian woman discovered in Mexico city in 1959. Three other famous 87 skeletons (the Kolebjerg Man (8000BC), Loschbur-Fra (8000BC), the Frau von Bäckaskog (7000BC)) follow. The first individual Notes. Exhaustive sample (4.7 million individuals). The birth and death min and max are based on the precision of the related dates: millenia, centuries. Figure S2 provide the split over 4 sub-periods of the ratio of world population to the population of notable individuals. Our 105 database population contains approximately one person out of 250 000 before 500AD, the ratio then declines continuously until 106 one out of 50 000 in 1500AD, continues declining to reach a local maximum in 1700 due to a larger mortality in our database, 107 and reaches a minimum of 1 over 3200 in 1950. Afterwards, the ratio goes up again due to fewer people in the database: as 108 people tend to become famous later in their career, the most recent years (after 1990) have by construction only relatively young 109 individuals who aren't identified yet but will enter the database in the coming decades. 110 We next run a regression of the log of the ratio of the population in the database to the world population over time. More 111 precisely, denoting by t the calendar time and lnX(t) the log defined above in each year, we estimate The coefficient b is negative and tells us how an additional year of distance to present times leads to a percentage decline in 113 the number of famous people relative to the world population at that time. We find on the restricted dataset that b = −.0016465 114 with a s.d of .0000146. The rate at which the fraction of famous people declines after T periods is therefore 1 − (1 − b) T which is 115 15.2% each century, or 56.1% after 500 years, or 80.8% after 1000 years.  Notes. Restricted sample (at least one Wikipedia edition among the 7 European languages analyzed), see Section Extracting biographic information from a restricted sample. For a given year, the number of living individuals is calculated by summing up all individuals such that birth_date ≤ year ≤ death_date. When not available, the date of birth (resp. death) is estimated from the estimated average longevity over the period.

117
In this section, we describe the recursive structure of the database. We list, following an iterative elimination process, the most  Notes. Exhaustive sample (4.7 million individuals). The acronyms We, Ea, EuAr, Sn are defined in Table 1, and correspond to groups of language edition of Wikipedia. Numbers in this table slightly differ from numbers in Table 1 in that these are based on language editions as per Wikidata. These language groups are not linguistic groups, but geographic groups. In Table 1 instead, we used language editions as they appear in Wikipedia biographies, which is more relevant for our data extraction based on the 7 language editions of Wikipedia.
In addition, English in this table includes Old English and Simplified English.

6/20
We also report in Table S3 Table 4 for detailed statistics of discrepancies between sources. For a significant number of 129 individuals especially from ancient times, the exact year is not available. We then use the century, millennium, circa or decade 130 information when available to estimate it. We build the relevant time intervals and use the middle of the interval as a proxy for 131 birth/death year. Overall, the exact date of birth (death) is known for more than 90% of cases (see Table 3 for exact numbers), 132 and we are able to impute 4% of new birth dates and 14% additional death dates. When the information is available for either 133 birth or death dates, we estimate longevity for the time period, gender, domain of influence and region, and predict the missing 134 date of birth or death based on estimated longevity. When we have no information on both birth and death dates, no imputation 135 is possible and we exclude individuals from all graphs with a time dimension, although many of them are from the 20th and 21st 136 century. Table S1 in Appendix A.4 reports the list of the eldest people in the exhaustive database.

Figure S3. Time evolution of the number of individuals in the database by gender and language editions
Notes. Cross-verified, restricted sample (at least one Wikipedia edition among the 7 European languages analyzed), see Section Extracting biographic information from a restricted sample. For a given year, the number of living individuals is calculated by summing up all individuals such that birth_date ≤ year ≤ death_date. When not available, the date of birth (resp. death) is estimated from the estimated average longevity over the period. English (En) language groups include individuals with at least one biography in English in Wikipedia; Western non-English (We) includes individuals with a Wikipedia biography in at least one of the Western languages but absent from En. See Table 1 for precise definitions of these groups and sub-groups. Individuals with more than one biography account for one observation to avoid double counting.  Table 1 for precise definitions of these groups and sub-groups. Individuals with more than one biography account for one observation to avoid double counting.  In Wikipedia, keywords related to the domain of influence are found in the first part of most biographies after verbal groups 147 such as "was a"/"is a"/"was the"/"is the". We first parse the English edition to detect keywords in a list of 1911 occupations and 148 select the first three keywords. In most cases, these correspond to a well-referenced occupation such as pianist, engineer,  and all Wikipedia pages. We use a frequency threshold equal to 25% above which we keep the second occupation. This 165 threshold value has been determined in a pilot study in which we gauged the number of errors generated when using more or 166 less constraining threshold values. A good illustration is Napoleon Bonaparte who is referenced in two main domains: "Politics" 167 and also "Military". Another example is Ronald Reagan, famous first for his prominent role in American Politics in the 80's and 168 also known as an actor. 169 To sum up, the easy cases are when Wikipedia's and Wikidata's keywords characterizing an occupation or a domain of 170 influence converge towards two identical modal occupations across sources. When this information diverges, we generally give 171 more credit to Wikidata. We however make an exception to this rule when there is a tie between the modes in Wikidata and 172 instead a clear, unique, mode in Wikipedia. In this case, we favor Wikipedia. In the more problematic case in which both 173 Wikidata and Wikipedia give several modes, we pool all keywords together and determine the mode from this combined list.  For most citizenships, we make a distinction between "old regime" and "current regime" and use the acquisition of sovereignty 179 information to determine whether an individual's citizenship belongs to one or the other. We proceed the same way with empires 180 to correctly assign individuals to either these supranational entities or to the new nation states that emerged after their collapse. 181 In the matter, we consider the three supranational entities: Holy Roman Empire, Roman Empire and Soviet Union. This grouping 182 procedure was made necessary given both the geographical expanse of such political entities and the fact that it is almost 183 impossible to associate them with a single modern country. Finally, a fraction of our individuals have several citizenships and we 184 report two of them whenever appropriate. Wikidata in the other cases is that the code written to extract this information in Wikipedia may introduce more mistakes, 202 since it needs to crawl the entire content of the biography to detect one or several citizenships that do not necessarily belong to 203 the individual. Lastly, in case the citizenship information is absent from one universe, we use the most frequent citizenship(s) 204 found in the other universe. 205 A large number of individuals have two citizenships, either because they are true bi-nationals (e.g. Indian and US citizens) or 206 because the country they were born in, disappeared or separated from a larger entity (for example, Bosnia and Herzegovina 207 from Yugoslavia in the 90's). We therefore decide to report up to two citizenships in the database for a better coverage. 208 The thresholds used to demarcate old political and geographical regimes from the modern state for each nation state are 209 available at: https://en.wikipedia.org/wiki/List_of_sovereign_states_by_date_of_formation.     251 The evolution over time of median longevity is shown in Figure S10. It was computed as the difference between death year and 252 birth year when available. As in previous studies 5 , we observe steady improvements in longevity of the cohorts born after 1600.

Figure S11. Age at death on English (left) and Western non-English editions (right), 1800-2000AD
Notes. Cross-verified, restricted sample (at least one Wikipedia edition among the 7 European languages analyzed), see Section Extracting biographic information from a restricted sample. The occupations are defined in Section Domains of influence and occupations. For a given year, the number of living individuals is calculated by summing up all individuals such that birth_date ≤ year ≤ death_date. When not available, the date of birth (resp. death) is estimated from the estimated average longevity over the period. In both panels, a vertical line corresponds to the distribution of the age at death for a given date. The observed colors discontinuity illustrates wars episodes: American Civil War, First World War, Spanish Civil War, and Second World War. • You will be asked to report the verification in the google sheet next to each information. "Correct" means no error, "Error" means certain 298 error, "Missing" means the information is included in Wikipedia/Wikidata but not present in the dataset, "Other case" means possible 299 error. Judgment is required from you.

300
-For instance, if there is a historical controversy and several sources differing, report "Other case" unless there is an obvious mistake 301 in our database.

302
-It will be particularly the case for the retained citizenship that is sometimes selected among a list of ten or more different geographical • Cross-verification: A part of the sample is common to other research assistants to assess the accuracy of your work. There will be an 324 end of contract reward of up to 12.5% of the contract for the quality of the work. 325 • Remember that the goal is neither to minimize nor to maximize the number of spotted errors but to detect and provide a fair assessment 326 of the quality of the database. Keep all your comments and suggestions on the spreadsheet as it may be requested by editors of scientific 327 journals. In case of doubt, report "Other case" as explained above, and the reason for the doubt about the information contained in the 328 database. 329 At the end of the pilot, we looked at the various errors detected. In particular, as regards to occupations, we noticed that when the frequency 330 of the second occupation was below 0.25, there was a large proportion of errors; we decided to set this as a threshold, since it preserves many 331 true positive regarding the second occupation. • We will give you 1000 individuals from various notability levels, and ask you to check and validate or report mistakes on 6 different pieces 336 of information: exact or approximate date of birth and death; gender; main occupation and possibly secondary occupation; citizenship or 337 equivalent concept for earlier periods of history. 338 • You will be asked to report the verification in the google sheet next to each information. "Correct" means no error, "Error" means certain 339 error, "Missing" means the information is included in Wikipedia/Wikidata but not present in the dataset, "Other case" means possible 340 error. Judgment is required from you.

341
-For instance, if there is a historical controversy and several sources differing, report "Other case" unless there is an obvious mistake 342 in our database. 343 -It will be particularly the case for the retained citizenship that is sometimes selected among a list of ten or more different geographical • Cross-verification: A part of the sample is common to other research assistants to assess the accuracy of your work. There will be an 366 end of contract reward of up to 12.5% of the contract for the quality of the work. 367 • Remember that the goal is neither to minimize nor to maximize the number of spotted errors but to detect and provide a fair assessment 368 of the quality of the database. Keep all your comments and suggestions on the spreadsheet as it may be requested by editors of scientific 369 journals. In case of doubt, report "Other case" as explained above, and the reason for the doubt about the information contained in the 370 database.  Notes. This table provides the numbers and rates of discrepancy, when independent RAs did not report the same outcomes among Correct, Error, Missing, Other case for the same individual. The first row gives the number and the second row gives the frequency. Notes. Test sample on a mix of the exhaustive and restricted database (at least one Wikipedia edition among the 7 European languages analyzed) with over sampling, see text. This table provides some summary statistics on manual checks. The different possible outcomes are" "No info in sources" means the information is not included in Wikipedia/Wikidata nor in our dataset, "Info updated since data collection" means the information is included in Wikipedia/Wikidata but not present in our dataset; "Information Correct" means no error, "Error" means certain error, "Other case" means possible error (for instance historical controversy, several sources differing or information updated since data collection). These checks have been conducted by the 10 RAs and cross-verified by a PhD researcher. Notes. Test sample on the restricted database (at least one Wikipedia edition among the 7 European languages analyzed) with over sampling of the top and of the bottom, see text. This table provides some summary statistics on manual checks. The different possible outcomes are: "No info in sources" means the information is not included in Wikipedia/Wikidata nor in our dataset, "Info updated since data collection" means the information is included in Wikipedia/Wikidata but not present in our dataset; "Information Correct" means no error, "Error" means certain error, "Other case" means possible error (for instance historical controversy, several sources differing or information updated since data collection). These checks have been conducted by the 10 RAs and cross-verified by a PhD researcher. Notes. Test sample on the top 1000 most notable of the database. This table provides some summary statistics on manual checks. The different possible outcomes are: "No info in sources" means the information is not included in Wikipedia/Wikidata nor in our dataset, "Info updated since data collection" means the information is included in Wikipedia/Wikidata but not present in our dataset; "Information Correct" means no error, "Error" means certain error, "Other case" means possible error (for instance historical controversy, several sources differing or information updated since data collection). These checks have been conducted by the 10 RAs and cross-verified by a PhD researcher. Notes. Test sample on the subset of the restricted database (at least two Wikipedia editions among the 7 European languages analyzed). This table provides some summary statistics on manual checks. The different possible outcomes are: "No info in sources" means the information is not included in Wikipedia/Wikidata nor in our dataset, "Info updated since data collection" means the information is included in Wikipedia/Wikidata but not present in our dataset; "Information Correct" means no error, "Error" means certain error, "Other case" means possible error (for instance historical controversy, several sources differing or information updated since data collection). These checks have been conducted by the 10 RAs and cross-verified by a PhD researcher. Notes. Test sample on the those with no Wikipedia edition among the 7 European languages analyzed. This table provides some summary statistics on manual checks. The different possible outcomes are: "No info in sources" means the information is not included in Wikipedia/Wikidata nor in our dataset, "Info updated since data collection" means the information is included in Wikipedia/Wikidata but not present in our dataset; "Information Correct" means no error, "Error" means certain error, "Other case" means possible error (for instance historical controversy, several sources differing or information updated since data collection). These checks have been conducted by the 10 RAs and cross-verified by a PhD researcher.